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ABSTRACT 

Motivation: Approaches that use supervised machine learning 
techniques for protein-protein interaction (PPI) prediction typically 
use features obtained by integrating several sources of data. Often 
certain attributes of the data are not available, resulting in missing 
values. In particular, our host-pathogen PPI datasets have a large 
fraction, in the range of 58-85% of missing values, which makes it 
challenging to apply machine learning algorithms. 
Results: We show that specialized techniques for missing value 
imputation can improve the performance of the models significantly. 
We use cross species information in combination with machine 
learning techniques like Group lasso with regularization. We 
demonstrate the benefits of our approach on two PPI prediction 
problems. In our first example of Sa/mone//a-human PPI prediction, 
we are able to obtain high prediction accuracies with 77.6% precision 
and 84% recall. Comparison with various other techniques shows 
an improvement of 9 in F1 score over the next best technique. We 
also apply our method to Vers/n/a-human PPI prediction successfully, 
demonstrating the generality of our approach. 
Availability: Predicted interactions, datasets, features are available 
at: |http://www.cs. cmu.edu/~mkshirsa/eccb201 2_paper46.html.| 
Contact: judithks@cs.cmu.edu 

Supplementary Information: [Supplementary data] are available at 
Bioinformatics online. 



1 INTRODUCTION 

Identification of protein-protein interactions (PPIs) between a 
pathogen and its hosts will not only give us insights into the 
molecular mechanisms underlying pathogenisis but also provide us 
with potential targets aiding in the development of pathogen-specific 
therapeutics. Computational methods complement experimental 
methods in studying PPIs by predicting highly probable interactions, 
which are used by experimentalists to filter out unlikely interactions 
from the prohibitively large set of potential interactions that need 
to be tested. In particular, supervised machine learning methods use 
the existing known interactions as training data and formulate the 
interaction prediction problem in a classification setting, with target 
classes: 'interacting' or 'non-interacting'. Features are derived on 
the data by integrating various attributes of proteins, which include 
protein sequences f rom Unipro t i UniProt Consortiumll201 lb . protein 
family from Pfam l lFinn et a/.Ll201fj|) . protein st ructure from PDB, 
gene ontology (GO) from the GO database ( Ashburner et all 
l200fj|) . gene expression from GEO jBarrett et a/.Ll2*0*l*l[) . interactions 



between protein families from iPfam i Finn et all |2005|) . protein 
domain interactions from 3DID dStein etal. , 2011 ). to name a few. 

Although these attributes are available for some proteins, many of 
them remain unknown for various reasons: they are a part of ongoing 
experimental studies, difficulty of experimentation due to the 
nature of the protein, lack of efficient high-throughput techniques, 
limited pace of manual curation from published literature, etc. 
Such unavailable information results in missing values in the 
dataset. Missing values make the application of any machine 
learning algorithm and data analysis technique difficult, as most 
approaches were designed to work on complete or nearly complete 
data. Imputation of missing values involves exploiting available 
information about the data in order to best estimate the missing 
entries. Thus, the choice of an appropriate missing value imputation 
method is very important in order to build good predictive models. 

Here, we develop methods to deal with missing values for host- 
pathogen PPI prediction. Prior work on PPI prediction within the 
supervised learning fra mework deals wi t h mis sing values in one 
of the following ways. i Mohamed et flZl l2010h_ eliminate all data 
with missing values. JOi et all 120051) and dSingh et all 120061) 
use a Random Forest (RF) cla ssifier-induced imputation originally 
introduced by l lBreimanil200 ll ). which uses mean values of features 
in combination with proximity to other inst ances. The abstainin 
varian t of Adaboost originall y introduced by dSchapire and Singe 
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varian t 01 Adaboost originall y introduced by 1 achapire and aingcu . 
1 19991) has been applied by JsmeraldF et q/.Ll2010h . The work by 
l Oi et aZ.I . I2007T) adds one extra binary 'ind icator' feature for every 
existing feature to indicate its availability. dDver et o/ll201lh limit 
the data sources used to generate features so as to exclude the 
possibility of any missing values. 

As the main application of our techniques, we consider 
Salmonella-human PPI prediction, which is an important step 
towards developing our understanding of Salmonello-sis, a disease 
that causes millio ns of infection s and thousands of deaths every 
year world-wide iGraharrJ, I20T0I ). The bacterium Salmonella has 
4533 protein sequences in the reference proteome set in Uniprot 
database, known molecular functions for 1058 proteins in GO, 
protein structures for 592 proteins in PDB and protein family 
information for 2978 proteins in Pfam database. Approaches that 
construct features integrating data from these databases will thus 
have many missing values since not all attributes are available for all 
proteins across all databases. In the example dataset of Salmonella- 
human PPI that has 62 known interactions, with our feature set we 
find that «a58% of the interactions have at least one feature with 
missing values. 

To cope with missing values, we can apply standard statistical 
imputation methods or develop application-specific methods that 
take advantage of the particular properties of the data being imputed. 
Our work falls in the second category and is novel in its use of 
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data from other related species, in addition to information from the 
species of interest itself. To transfer this information from other 
closely related species to the species of interest, we use protein 
sequence alignment to define a measure of similarity between the 
proteins from the two species. We use ideas from nearest-neighbour- 
based methods to combine such cross-species data. Our techniques 
have several benefits: 

1 . PPI datasets can have a very high missing value rate. Methods 
that use the limited available features to impute a large number 
of missing features are likely to produce very biased models. 
Our method uses cross-species data in addition to the available 
features thus reducing this bias. 

2. We make no independence assumptions on the features. 
Previous work in imputation for PPI assumes that all features 
are independent of each other. However, this is not always 
true, e.g., gene expression data are a time series of expression 
measurements and features derived from such data cannot be 
considered independent. 

3. We avoid explicitly estimating the density of very high 
dimensional features in order to impute the values. 

The method we present is general and can be used whenever there 
is available data from a related distribution. The proposed use of 
abundantly available cross-species data is analogous to sampling the 
missing values of a feature from another related feature distribution. 
We demonstrate the generality of our approach by successfully 
extending our modelling efforts to predict PPIs in Yersinia pestis. 

2 APPROACH 

2.1 Problem setup 

We describe the problem setup in the context of Salmonella- 
human PPI prediction; however, the ideas are similar for other 
host-pathogen PPI prediction tasks. We use a binary classification 
framework defining an 'instance' to be a pair of proteins <p s ,p n >, 
where one protein is the pathogen protein 'p s ' (i.e. Salmonella) and 
the other is the host protein 'p n ' (i.e. human). Features are defined 
for every protein-pair using various properties of the individual 
proteins and combining them all into a single feature vector. Details 
of our feature-set are listed in Table[2] These features are used with 
discriminative classifiers like support vector machine (SVMs). 

We used the list of 62 interacting protein-pairs reported in 
ISchlckcref «/.Ll2012l) — these data form the 'interacting' or 'positive' 
class. Since there is no experimental evidence about proteins that 
do not interact, we construct the 'non-interacting' or 'negative' 
class using a technique commonly used in PPI prediction — using 
random pairs of proteins obtained by sampling the set of all possible 
Salmonella-hmnw protein pairs. Our dataset had 3592 unique 
Salmonella^ and 24431 unique human proteins, resulting in «90 
million protein pairs. The missing value analysis we present in 
Section l2~2l was done on a small random subset 'M' of size 40000 
pairs that includes all interacting pairs. These data were used to 
decide on imputation strategy and not our training data — the details 
of training and test sets can be found in Section 3. 
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2.2 Mechanism of the missing values 

Let Xj = (Xjj), where j = 1 , . . .p, be the feature vector corresponding 
to the rth protein pair. The p features can be considered random 
variables that have either been o bserved or missing. The seminal 
work by c and Rubinl 1 19871) identifies three broad categories 
of missing value mechanisms that are briefly listed below. Please 
refer to the original work for examples and details of each. 

(1) Missing completely at random (MCAR): if the probability that 
an observation Xy is missing is unrelated to its value or to the 
value of any other variable [X^ :kj^j}. 

(2) Missing at random (MAR): if the missingness of an 
observation Xy is independent of its value, given the value 
of some other observed variables in [X^ :k^j}. 

(3) Missing not at random (MNAR): if the missingness depends 
both on the missing value itself and other observed variables. 

The missingness mechanism in PPI datasets is MAR. Eliminating 
data that have any missing values, an approach called 'listwise 
deletion', has been followed in some prior work. However, it is 
appropriate only for data that is MCAR, otherwise it leads to biased 
estimates and models. Our claim of MAR arises from the observation 
that the absence of features for a given protein pair depends 
on how well studied the individual proteins are. An approximate 
measure of 'well-studiedness' for a protein can be obtained from 
UniprotKB, which defines an attribute called 'protein existence' that 
indicates the type of evidence supporting the existence of the protein. 
There are five categories of evidence: (1) evidence at protein level, 
(2) evidence at transcript level, (3) inferred from homology, (4) 
predicted and (5) 'uncertain'. It is clear that proteins with status (1) 
or (2) can be considered to be well studied while the rest are not. 

Let this 'existence' status of proteins be two additional observed 
features E\ and E^, then we can say that the missingness of other 
features in Xi approximately depends on E\ and Ei . It is independent 
of the actual value of every missing feature, thereby giving us an 
MAR mechanism. We observed that in the set 'M' described in 
Section 12.11 the missingness of various features correlated well 
with Ei and E2 taking values of (3)-(5). Note how in this case, 
the removal of all data that have missing features will leave behind 
protein pairs that mostly involve well-studied proteins. A model built 
on such data will be undesirable as it will not glean any information 
about the less studied protein-pairs. 

2.3 Imputation of missing values 

Table Q] shows the percentage of missing values in our dataset, over 
all features and over individual feature types. The fraction of overall 
missing data is very high, about 58% for the positive class and 81% 
for the negative class. Furthermore, the missingness is highly skewed 
between the two classes, being significantly higher in the negative 
class data — this limits the choice of imputation techniques that can 
be used. Techniques like mean-value imputation can produce models 
that are biased towards predicting all data with the imputed features 
as 'negative'. We also want to point out that high-dimensional 
features like the GO features are very difficult to impute using purely 
statistical modelling-based approaches that are very inefficient when 
the number of parameters to estimate is huge (^180 million for the 
GO features). 
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Table 1. Percentage of examples with missing values 
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58.0 


81.2 


Gene Exp. GDS77 


21.1 


34.6 


Gene Exp. GDS78 


21.1 


31.4 


Gene Exp. GDS80 


22.6 


30.5 


GO component 


19.4 


52.0 


GO function 


5.0 


37.8 


GO process 


21.2 


45.2 



Shown over all features and also individually for the important features. Observations: 
the overall proportion of missing data is very high; the extent of missing values is highly 
skewed between the interacting and random pairs. 



Considering these issues, we use a localized nearest-neighbour 
like approach that uses protein sequence similarity as the basis to 
compute the locality. The approach works in two phases: (i) cross- 
species information integration and (ii) modelling based imputation. 
The first component modifies the data source before feature 
computation by supplementing it with cross-species information. 
The latter uses modelling techniques to estimate feature values. 

2.3.1 Cross-species information integration This technique relies 
on the observation that certain attributes of proteins like GO 
annotations and protein domains are similar for homologous proteins 
across various species. Whenever these properties are largely 
unavailable for any organism, we can exploit the vast amounts 
of equivalent information available in closely related organisms. 
For instance, the bacterium Salmonella is very similar to some 
other Gram-negative bacteria such as Escherichia coli, Francisella 
tularensis, Yersinia pestis, Y.pestis, etc. Some of these bacteria have 
been better studied and are likely to have more properties available 
than Salmonella. We illustrate the case of GO attributes in the context 
of two bacteria. The grey curve in Figure[T]shows the similarity in the 
GO term^l between Salmonella and E.coli genes. The curve plots for 
every Salmonella gene (along x-axis), the overlap in the GO terms 
with the most homologous E.coli gene (along v-axis). Homologou^l 
gene pairs were obtained using BLAST alignment (e-value <0.01). 
This GO-overlap was computed only for Salmonella genes that had 
at least one GO annotation. Of these genes, 20% did not have any 
homologues in E.coli and are shown in the curve with the overlap 
value of 0. About 68% of the Salmonella genes have a GO-overlap of 
>50% with the corresponding homologous E.coli gene. We perform 
a similar analysis on protein family information (Pfam) for the two 
bacteria; the overlap is shown in Figure [T] (solid line). 

Imputing the GO features: We used the above approach 
to impute GO annotations in both Salmonella and human proteins. 
For Salmonella proteins, we find proteins in all other bacteria 
which cross the BLAST e-value cutoff of 0.01 and use their GO 
annotations; whereas for human proteins, we use the GO annotations 
of aligning proteins from mouse and rat. The imputed value of a GO 
feature for a protein pair <p s ,p n > where say the GO annotation of 



2 Terms from the three ontologies: cellular component, molecular function 

and biological process were combined for this analysis. 

'Overlap = intersection of the two sets of GO terms or Pfam ids. 

4 We use 'homologue' in a loose sense for sequences similar to each other. 



Fig. 1. Overlap in GO terms for two closely related bacteria: Salmonella and 
E.coli. Each point on the curve represents the GO overlap of a Salmonella 
gene with the most homologous E.coli gene. Of all gene pairs, 68% show a 
high overlap in GO terms and 79% show a high overlap in Pfam 

p s is missing, uses the linear combination 

fGO(Ps,Ph)= ^2 «fc/GO(Pb.Ph). 
Vp b :hom(p s ,p b ) 

where fGO(Pb>Ph) computes the GO similarity feature between 
the proteins p^ and p n using the method listed in Table [2] the 
function hom(., .) checks if bacterial protein 'pb' is homologous to 
the Salmonella protein p s , and at, e [0, 1] is the normalized similarity 
between p s and pj>. We use the BLAST alignment e-values to 
compute «[,. Other cases where only p n , or both p s and p n are 
missing GO annotations are computed in an analogous manner. 

The goal of our imputation method is to obtain a higher coverage 
over the missing data, at the same time maintaining good prediction 
accuracy. The e-value cutoff of 0.01 gave a good predictive 
performance. We also tried a lower threshold of 0.0001 and the 
results are discussed in Section 4. 

Imputing the gene expression features: This is based on the 
premise that genes whose protein products are similar are likely to 
have a similar expression pattern. The imputed value for a protein 
pair <p s ,p n > would be computed as 

/ge(Ps,Ph) = S ene - ex P(Ph)= ^2 «*r-gene_exp(p j(: ), 
Vp k :hom(p h ,p k ) 

where 'gene_exp()' represents the expression of the gene encoding 
the given protein, hom(-,-) and are defined as before. 

2.3.2 Modelling-based imputation The above technique is able to 
transfer a large number of cross-species attributes, thereby reducing 
the fraction of missing data ranging from 58 to 80% to about 
10-15% in both positive and negative class data. The proteins with 
the remaining missing attributes are either species-specific and do 
not have any homologues to transfer information from or have 
homologues which are also missing the attributes of interest. To 
deal with this much smaller fraction of remaining missing data, we 
use modelling-based imputation techniques. 

Modelling gene expression features: Gene expression data are 
time-series data with several measurements spread across time or 
across different types of control experiments. Each expression value 
is not independent of the others, and it is important to model their 
distribution jointly. jLiew et o/ll20 1 d) give a good review of missing 
value imputation for gene expression data. Our approach is different 
since it is related to PPIs. In addition to the gene expression data 
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Table 2. Feature set: summary of the various categories of features and the number of features in each category 



Feature name (count) Description 



Gene ontology a (~177 million) (i) Pairwise similarity: computed between GO terms of p s and ph. Let S= set of all GO terms, S p = set of GO terms for 

protein 'p' . We set the entries of S x S corresponding to all pairs of GO terms from S Ps x S Ph to the similarity between the 
GO term pairs. Similarity between two individual GO-terms was computed using G-Sesame. 
(ii) Neighbour similarity: computed analogously using GO terms of p s and binding partners of p h in the human 
interactome. Total number of features from both categories = 2 \S\-\S\ 



Network-based (3) 


Uses three graph properties of pi, in the human protein interaction network: (i) 'degree' = number of neighbours of ph; 

(ii) 'clustering coefficient' = ratio of edges present amongst neighbours of ph to all possible edges between them; 

(iii) 'centrality' = fraction of shortest paths in the network that pass through pi, 


Gene expression (7) 


Derived using the gene of the human protein ph. Uses three GEO datasets: GDS77, GDS78, GDS80 reporting differential 
gene expression of human genes infected by Salmonella, under seven different control conditions. 


Conserved pathways (2) 


We look for interactions <pb,Ph 2 > in other known bacteria-human PPI datasets, where bacterial protein pb is a 
homologue of p s , and human proteins ph and ph 2 share a pathway. Two such features were defined: (i) GP: aggregates 
bacteria-human PPI datasets across Gram-positive bacteria and (ii) GN: aggregated over Gram-negative bacteria. 
Let P c = number of human pathways shared by two human proteins; Hom(p\,p2) = 1 if the two proteins pi,pi are 
homologous, 0 otherwise. Then, / cp (p s ,ph)= ^ Hom(p bi ,p s )P c (Ph i ,Ph); 

(Pbj.Phj) 


RNAi expression (2) 


Uses human senes teemed as 'hits' bv the RNAi screens rjublished in jMisselwitz et a/.Ll201lh. Defines two features: (i) 
pathway-enrichment: feature is 'on' if ph belongs to a pathway statistically enriched by the RNAi hit genes, (ii) 
complex-enrichment: 'on' if ph belongs to statistically enriched protein complexes. Enrichment analysis used the 
hypergeometric test with p-value cutoff of 0.01. 


Interologues (1) 


Number of protein-pairs from other species that are interologs of the given pair <p s ,ph>. 


Sequence n-grams (39 200) 


Used the 'conioint triad model' JShen et «/.Ll2007l) to set n-sram features on the protein seauence for n = 2, 3, 4, 5. The 
amino acid sequence is first converted to class sequence and n-grams are computed separately for p s and Ph and then 
concatenated to sive a single feature vector similar tolDver et a/.N2()llh of size =2(7 2 + 7 3 + 7 4 + 7 5 ). 


Pfam interactions (1) 


Counts the fraction of all possible interactions between the Pfam families of p s and p^ that are listed as known 
interactions in the iPfam database iFinn et a/.Ll2005b 


Domain interactions (1) 


Similar to the above feature, computes the fraction of all possible domain-domain interactions between p s and ph that are 
present in the domain interactions database 3DID dStein et «/.Ll201ll). 


Protein sequence (2) 


(i) Log of the e-value from the sequence alignment between proteins p s and Ph, computed using PSI-BLAST 
iAltschul et alW 19971). (ii) Los of the e-value of the seauence alignment between p< and Ph's bindins partners in the 
human interactome. 

The maximum value is taken over all the binding partners. 


Pl, represents the human protein, and p s represents the Salmonella protein in a given protein pair <p s ,p h >. 
a sparse features, i.e. only some of the millions of features are active in a single protein-pair. 


itself, we use protein sequence data to impute missing values. To Y el" xt refer to k time-series gene expression values for the n 
use protein sequence data, we convert each sequence to a feature proteins. We have 2793 sequence features that make d = 2793, and 
vector usina the 'conjoint triad' conceDt from dShen et al[ 120071). the set of all human proteins has n & 25 000 proteins. One of our gene 
They partition the amino acids into seven classes based on their expression datasets 'GDS78' downloaded from GEO jBarrett et all 
properties and replace every amino acid in the protein seauence bv 1201 lb, has three time-series measurements for each gene: thus k = 3. 
the appropriate class. We build n-gram features on this modified Since we want the response Y to be jointly dependent on a subset of 
sequence for n = 2,3,4. This gives us 2793 features. Note how the covariates X, we use the block regularization regression model 
these features encode the protein seauence at a much coarser level; from the work bv dObozinski et a/.Ll200&1), also called 'Group Lasso 
proteins with similar features are likely to have similar properties. with l\ll2 regularization'. The sparsity constraints due to the l\ 
Given the observed gene expression data and sequence features norm in the formulation allow us to select a subset of the covariates 
for every protein, we model this imputation problem as a multi- (i.e. the sequence features) to regress upon. The I2 norm allows 
variate regression problem: Y — XB + e. The covariates XeM. nxd for sharing of this covariate structure across the different response 
are the sequence features for n proteins and the response variables variables (i.e. sharing across the different time points). The details 
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of this model, formulation and implementation are described in 
Subsection l3.ll 

We also tried to regress each of the individual time points Yj 
separately using simple linear regression and squared error loss 
with lasso regularization. However, in both these models 
the independence assumption between the response variables Y — 
{Yj} resulted in undesirable models. In order to illustrate that 
this imputation indeed improves the performance, we trained two 
models — one that discards examples with missing gene expression 
values, and the other that uses this regression-based imputation. We 
found an improvement of about 4% in Fl. 

3 METHODS 

3.1 Imputation using group lasso with I1/I2 penalty 

Group lasso with I1H2 regularization lObozinski et all I2OO8I) is a way 
to perform high-dimensional multivariate linear regression in the scenario 
where the set of all response variables is to be regressed on the same set of 
covariates. Let YeM. nxk be the matrix of real-valued observations, where 
each of the n observations comprises k responses. Let XeW xd be the 
design matrix with the set of d covariates for each of the n observations. 
The multivariate linear regression model takes the form 

Y=XB* + €, 

where B*eR dxk is the regression coefficients matrix and t eW xk is the 
noise matrix with zero mean. B* has d coefficients corresponding to each 
of the k response variables 7;. For high-dimensional problems, a sparsity 
condition is assumed meaning that 'most' of the d covariates have zero 
weights across all the tasks. A natural way to solve for the coefficient 
matrix B* is to formulate the regression as an optimization problem with 
the objective of minimizing the squared error, at the same time maintaining 
the sparsity of B* as shown below: 

B* = arg min \ ^\\Y -XB\\ 2 F + k\\B\\ h/h \ , (1) 

where B=(fjy) with l<i<d, l<j<k; rows representing the d covariates and 
columns representing the k responses and ||B||;j/; 2 =X!;=i II Alb i s the block 
tl/f-2 norm; the norm \\.\\p is the Frobenius norm of a matrix computed as 
the sum of squares of all individual elements of the matrix. 

The Ci/t2 norm has several nice properties. The I2 norm couples together 
the columns of the coefficient matrix, as a consequence of which coefficients 
for a particular covariate jointly remain non-zero across all k responses. In 
our context this implies that, a particular sequence feature will be non-zero or 
active for the entire gene expression time series. Also, the i\ norm decouples 
the rows of the coefficient matrix B* and adds sparsity which results in some 
of the d rows getting selected as the important covariates and the remaining 
going to zero. This causes some of the sequence features getting selected as 
the important ones. 

The optimization problem in Equation (1) is convex and is solved by using 
the dual aug mented Lagrangian method for e fficient sparse reconstruction, 
propo sed by jTomioka and Sugivamal [20091). We use the imp lementation 
from JPunivani et a/.Ll20ld) and iTomioka and Sugivamall2009l) . Parameters 
were tuned using 3-fold cross validation. 

3.2 Dataset and training 

We use the 62 Salmonella-human PPIs reported in <SchlekerefaZLl2012h as 
the 'positive class' in our gold standard data. Call this set P. Let 'U' be the 
set of protein pairs obtained by pairing up all Salmonella proteins with all 
human proteins. Our proteins data described in Section lTTl gave us \ U\ ~90 
million pairs. We sample some random pairs from the set U\P, to derive the 
set N that will be the negative class. The number of random pairs chosen to 
be the negative class is decided by what we expect the interaction ratio to 



be. We chose a ratio of 1 : 100 meaning that we expect 1 in every 100 random 
pairs ofjroteins^ojnterac^vifii^ac^ 

et al. l2010l) , lTastan et all b009h . Our training data D = PUN thus had 62 
positives and 6200 negatives. To generate novel potential interactions given 
the gold-standard data, we apply the classifier model built on D to the rest 
of the protein pairs: (U\D) and report the positive pairs predicted by the 
classifier. 

3.3 Classifiers for learning 

In ou r techniques, we trie d SVM jBoser e/oZll 19921) and Logistic regression 
(LR) iHastie et allfiwk with 1 1 and I2 regularization. A linear kernel was 
used for SVM experiments. Since our feature set is very high dimensional, the 
l\ regularization-based methods were the most efficient. We found that LR 
with l\ regularization (LR LI) had the best accuracy in all our experiments. 
This is explained by the fact that our feature set is very sparse and 1 1 is known 
to learn sparse h ypotheses well. We used the Liblinear implementation by 
<Fanef q/lEool for both SVM and LR. 

3.4 Experimental setup 

Our evaluation criteria were chosen considering the characteristics of our 
dataset: high class imbalance and small number of labelled examples. Instead 
of using accuracy that measures performance over both the classes, we use 
precision, recall and Fl computed on the positive class. Note how a classifier 
that labels all data as 'negative' will have a high accuracy but 0 recall. 

3.4.1 Bootstmn^am^>lin^^x^erm and 
Tibshirani, 1 1994 is used in statistical estimation when the available labelled 
examples are small in number. We perform 100 randomized bootstrap 
sampling experiments. For each experiment, we first make two random splits 
of 60 and 40% of the data, such that the class ratio of 1:100 is maintained 
in both. The training set is constructed using a bootstrap sample from the 60% 
split and the test data from the 40% split. Precision and recall are computed 
for each experiment and then averaged over the 100 experiments. 

3.4.2 Precision-recall curves (PR) curves have been shown to give a 
more informative picture of an algo rithm's performanc e than ROC curves 
in the case of high class imbalance iDavis et The PR curve for 
every method was obtained by averaging over the 100 bootstrap sampling 
experiments. 

3.4.3 Leave-one-out cross-validation This technique, also called 'jack- 
knifing' is known to give an almost unbiased estimate of the true error, as 
it uses most of the available labelled data. It is a cross-validation technique 
where one instance is left out in each iteration and used for testing. Say 
the gold standard dataset contains P positive examples. We perform P 
experiments leaving out one positive and 100 negative examples in each, 
while training the model on the remaining P— 1 positives and (P— 1) x 100 
negatives. This maintains the class ratio in all experiments at 1:100. 

3.4.4 Parameter tuning To tune parameters, we perform a grid search over 
a range of possible parameter values and combinations of learning algorithms 
(SVM or LR) and type of regularization (l\ or I2). The parameters for SVM 
and LR are the tradeoff parameter k and the class-s pecific loss parameters C+ 
and C-. We did a 5-fold nested cross validation IVarma and Simod , l2006l) . 
the results are reported in the |Supplementary Material] The parameters that 
give best performance were chosen and fixed for all the random sampling 
and leave-one-out cross-validation (LOOCV) experiments. The same tuning 
technique was used for the alternate algorithms we compare against. 

4 RESULTS 

We evaluate our cross-species imputation technique by comparing 
it (i) with other imputation techniques and (ii) with other recent 
work in host-pathogen PPI prediction. Results are shown in Table[3] 
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for the two scenarios. For the 100 bootstrap sampling experiments, 
along with precision and recall averaged across 100 runs, we also 
show the standard deviation V. 

4.1 Comparison with other imputation methods 

Both classifier-induced and data-driven imputation methods were 
applied on the feature-set described in Table [2] RF imputation is a 
classifier induced imputation, which works during the training phase 
of the classifier. The data driven imputation methods, namely mean 
value and cross-species, work before the classifier. Subsequently, 
we apply two classifiers on this imputed data: SVM and LR. For 
each classifier, we found that t\ regularization outperforms I2 
regularization. We compared the following imputation approaches. 

• RF imput ation The implementation of RF by ( Breiman 
and Cutler. 1200417 imputes missing data by initializing each 
missing value to the mean of the feature and re-estimating 
values by building a forest and using nearest examples from 
the leaf nodes. 

• Mean-value imputation Here, the average value of every 
feature is used to impute missing values. Feature averages 
obtained on the training data were used to impute both 
training and test data. Another related technique is 'class- 
wise' or 'multiple' mean-value imputation, where the averages 
are computed separately for the two classes. We tried this 
technique and the results did not improve over mean-value 
imputation. 

• Cross species imputation This refers to our techniques 
described in Sections 12.31 and 3. We show results for two e- 
value thresholds: 0.01 and 0.0001 that define 'homologous' 
proteins in other species that will be used for imputation. 

• Deletion imputation We simply discard all data with missing 
values. This leaves us with 26 interacting pairs; we pick a 
random set of 2600 random pairs to be the negative class, 
such that there are no features with missing values. We find 
that l\ regularized LR, when applied on this reduced dataset 
performs better than SVMs or regularization. 

In Table[3] we see that LR with i\ regularization (LR LI) applied 
on the cross-species imputed data using a BLAST e-value threshold 
of 0.01 outperforms all other approaches, with an Fl score of 76.4. 
This is an improvement of about 7.2 points over the next best result 
of 69.2 achieved by LR LI with mean-value imputation. Note that 
deletion imputation that uses a much smaller dataset has a very 
high variance ((7^18) in its performance. It is very sensitive to the 
dataset used and will produce a model that overfits the data, which 
is undesirable. The LOOCV results show that our technique has 
an Fl score of 80.7, which is 9 points better than the mean-value 
imputation technique. The LOOCV performance for all techniques 
is better than the bootstrap sampling experiments, since they use 
the maximum available training data. The improvement obtained 
by our method is evident from the percent increase in Fl that we 
achieve over the other imputation methods: 18.4% over deletion, 
10.4% over mean-value and 46.9% over RF imputation. Permutation 
tests comparing our technique with all others show that the gains are 
statistically significant. 

Figure|2]shows the PR curves for all the imputation techniques. RF 
imputation and deletion have good precision but suffer in recall; the 



Table 3. First three columns show the performance of various techniques on 
precision, recall and Fl averaged over 100 bootstrap sampling experiments 
along with the standard deviation 'er'. The last three columns show the 
precision, recall and Fl obtained using LOOCV. Bold entries correspond 
to highest accuracy achieved across the various methods. 'RF', Random 
Forest, 'LR', Logistic Regression; 'LI' refers to the l\ regularizer. 
a Did not finish running 100 bootstrap runs, so averaged over 30 
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61.56 
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72.4 
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61 
a: 20.7 


64.5 
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67.4 
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Fig. 2. Precision-recall curve for various imputation techniques. 

opposite is true for Mean-val imputation. Cross species imputation 
however, does well on both precision and recall. All results are 
shown for positive to negative class skew of 1:100. We tried other 
class skews: 1:1, 1:50 and 1:150 and found that all techniques report 
higher accuracies for the lower class skews and the performance 
degrades gradually with increasing class skew. However, in all 
settings the ranking of relative performance remains the same: LI LR 
with cross-species imputation performs the best, with the difference 
in performance being more on higher class skews. In the case of 
equally balanced class ratio (i.e. 1:1), all the curves come very close 
to each other. 

4.2 Comparison with related work 

Two methods originally proposed to address the HIV-human PPI 
prediction problem were compared here. 
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Table 4. Percentage of examples with missing values in the Yersinia-human 
PPI dataset, for all features and separately for two features 



Type 


% Miss in 
positive 


% Miss in 
random 


All features 
Gene expression 
GO 


83% 
15-27% 
39-66% 


85% 
24-43% 
40-54% 


Table 5. Bootstrap sampling results on yer.rim'a-human PPI prediction 
comparing different imputation techniques 


Method 


Prec. Rec. Fl 


Mean Value 
Cross species 


36.8 42.4 39.1 
44.6 42.8 43.7 



Tastan et al. 2009 used RF wi th the built-in classifier-in duced 
feature imputation technique from dBreiman and Cutlell2004l) . Dyer 
et al. 2011 and assumed there are no missing values and applied. 
SVMs with a linear kernel. 

For both methods, we derived their features on our dataset and 
applied their techniques. The results in Table|3]show the superiority 
of our combination of feature set, classifier and imputation method 
over previous work on host-pathogen PPI prediction. 

4.3 Results on Yersinia-human PPI dataset 

We demonstrate the consistency of our techniques on Yersinia pestis- 
human PPI prediction. The d ataset comes from high-throughput 
experiments jDver et alWlOldh and has 4067 reported interactions. 
We build a negative class dataset using random pairs of proteins 
as described earlier, with a class ratio of 1:100. Features similar 
to the ones described in Table |2] were used. This PPI dataset also 
has several missing values as shown in Table |4] We apply the two 
best imputation techniques and show results for the 100 bootstrap- 
sampling experiments in Table [5] The overall accuracies are lower 
for this dataset as it comes from high-throughput experiments that 
tend to be noisy, and also because it has a larger proportion of missing 
values. 

5 CONCLUSION 

We presented an imputation technique for dealing with missing 
values that arise in PPI prediction tasks. The improvements we 
obtain on both host-pathogen PPI datasets over prior approaches 
that use generic imputation methods clearly show the superiority of 
our method. We further observe that the importance of the imputed 
features in the learned classifiers is significant, implying that their 
imputation is important for prediction. Although our current work 
shows improvements in accuracy, we believe that the resulting 
models which are based on the MAR missingness assumption are 
less biased by the imputation technique as compared to the other 
commonly used methods. Characterizing and measuring such model 
bias as a consequence of imputation will be an interesting direction 
for future study, as it will make model selection more principled. 
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